February 19, 2018

Introduction

  • Supervised
  • Primarily categorical, but numeric also
  • Nonlinear
  • Split on features step by step to reduce disorder (uncertainty) or to gain the most information
  • Decision criterion
    – Information entropy Wiki
    – Gini index Wiki

  • Algorithm: top down, greedy, binary tree
    – ID3: iterative dichotomiser 3
    – C4.5 & C5.0
    – CART: classification and regression tree (1984)

PlayTennis Example

PlayTennis Data set

PlayTennis Decision Tree

Root node
An internal node tests a feature
Each branch is a feature value
Each leaf node assigns an observation to a category

ML Backgrounder Time Out: Information Entropy

First split

Entropy

## [1] " play tennis entropy 0.940"
## [1] " split on outlook: entropy 0.694"
## [1] " split on temperature: entropy 0.911"
## [1] " split on humidity: entropy 0.788"
## [1] " split on wind: entropy 0.892"

Information Gain = Entropy before split - Total entropy after splits

## [1] " split on outlook: information gain 0.247"
## [1] " split on temperature: information gain 0.029"
## [1] " split on humidity: information gain 0.152"
## [1] " split on wind: information gain 0.048"

Split on sunny

Split on sunny

## [1] " sunny entropy 0.971"
## [1] " split on temperature: entropy 0.400"
## [1] " split on humidity: entropy 0.000"
## [1] " split on wind: entropy 0.951"

Now use R

tennis <- read.csv("tennis.csv", stringsAsFactors=TRUE, header = TRUE)
str(tennis)
## 'data.frame':    14 obs. of  6 variables:
##  $ Day     : Factor w/ 14 levels "D1","D10","D11",..: 1 7 8 9 10 11 12 13 14 2 ...
##  $ Outlook : Factor w/ 3 levels "Overcast","Rain",..: 3 3 1 2 2 2 1 3 3 2 ...
##  $ Temp    : Factor w/ 3 levels "Cool","Hot","Mild": 2 2 2 3 1 1 1 3 1 3 ...
##  $ Humidity: Factor w/ 2 levels "High","Normal": 1 1 1 1 2 2 2 1 2 2 ...
##  $ Wind    : Factor w/ 2 levels "Strong","Weak": 2 1 2 2 2 1 1 2 2 2 ...
##  $ Play    : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 1 2 1 2 2 ...

tennis$Day = NULL  
head(tennis) ; lf()
##    Outlook Temp Humidity   Wind Play
## 1    Sunny  Hot     High   Weak   No
## 2    Sunny  Hot     High Strong   No
## 3 Overcast  Hot     High   Weak  Yes
## 4     Rain Mild     High   Weak  Yes
## 5     Rain Cool   Normal   Weak  Yes
## 6     Rain Cool   Normal Strong   No
printDim(tennis)
## [1] "nrow=  14"  "  ncol=  5"

summary(tennis) ; lf()
##      Outlook    Temp     Humidity     Wind    Play  
##  Overcast:4   Cool:4   High  :7   Strong:6   No :5  
##  Rain    :5   Hot :4   Normal:7   Weak  :8   Yes:9  
##  Sunny   :5   Mild:6
str(tennis) ; lf()
## 'data.frame':    14 obs. of  5 variables:
##  $ Outlook : Factor w/ 3 levels "Overcast","Rain",..: 3 3 1 2 2 2 1 3 3 2 ...
##  $ Temp    : Factor w/ 3 levels "Cool","Hot","Mild": 2 2 2 3 1 1 1 3 1 3 ...
##  $ Humidity: Factor w/ 2 levels "High","Normal": 1 1 1 1 2 2 2 1 2 2 ...
##  $ Wind    : Factor w/ 2 levels "Strong","Weak": 2 1 2 2 2 1 1 2 2 2 ...
##  $ Play    : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 1 2 1 2 2 ...

f = MakeFormula(tennis, 'Play') ; f 
## Play ~ Outlook + Temp + Humidity + Wind
## <environment: 0x7fe1580d5b50>
treeModel <- rpart(f,  data=tennis, 
                   method  = "class", 
                   parms = list(split = "information"),
                   control = rpart.control(minsplit = 2, minbucket = 1, 
                   cp = 0.001, maxdepth = 6) )

attributes(treeModel)
## $names
##  [1] "frame"               "where"               "call"               
##  [4] "terms"               "cptable"             "method"             
##  [7] "parms"               "control"             "functions"          
## [10] "numresp"             "splits"              "csplit"             
## [13] "variable.importance" "y"                   "ordered"            
## 
## $xlevels
## $xlevels$Outlook
## [1] "Overcast" "Rain"     "Sunny"   
## 
## $xlevels$Temp
## [1] "Cool" "Hot"  "Mild"
## 
## $xlevels$Humidity
## [1] "High"   "Normal"
## 
## $xlevels$Wind
## [1] "Strong" "Weak"  
## 
## 
## $ylevels
## [1] "No"  "Yes"
## 
## $class
## [1] "rpart"

summary(treeModel)
## Call:
## rpart(formula = f, data = tennis, method = "class", parms = list(split = "information"), 
##     control = rpart.control(minsplit = 2, minbucket = 1, cp = 0.001, 
##         maxdepth = 6))
##   n= 14 
## 
##      CP nsplit rel error xerror      xstd
## 1 0.300      0       1.0    1.0 0.3585686
## 2 0.100      2       0.4    1.2 0.3703280
## 3 0.001      6       0.0    1.2 0.3703280
## 
## Variable importance
##  Outlook     Wind Humidity     Temp 
##       48       23       18       11 
## 
## Node number 1: 14 observations,    complexity param=0.3
##   predicted class=Yes  expected loss=0.3571429  P(node) =1
##     class counts:     5     9
##    probabilities: 0.357 0.643 
##   left son=2 (10 obs) right son=3 (4 obs)
##   Primary splits:
##       Outlook  splits as  RLL, improve=2.1931200, (0 missing)
##       Humidity splits as  LR,  improve=1.4734210, (0 missing)
##       Wind     splits as  LR,  improve=0.4670276, (0 missing)
##       Temp     splits as  RLR, improve=0.2433601, (0 missing)
## 
## Node number 2: 10 observations,    complexity param=0.3
##   predicted class=No   expected loss=0.5  P(node) =0.7142857
##     class counts:     5     5
##    probabilities: 0.500 0.500 
##   left son=4 (5 obs) right son=5 (5 obs)
##   Primary splits:
##       Humidity splits as  LR,  improve=1.9274480, (0 missing)
##       Temp     splits as  RLR, improve=1.6389660, (0 missing)
##       Wind     splits as  LR,  improve=0.8630462, (0 missing)
##       Outlook  splits as  -RL, improve=0.2013551, (0 missing)
##   Surrogate splits:
##       Temp    splits as  RLL, agree=0.8, adj=0.6, (0 split)
##       Outlook splits as  -RL, agree=0.6, adj=0.2, (0 split)
## 
## Node number 3: 4 observations
##   predicted class=Yes  expected loss=0  P(node) =0.2857143
##     class counts:     0     4
##    probabilities: 0.000 1.000 
## 
## Node number 4: 5 observations,    complexity param=0.1
##   predicted class=No   expected loss=0.2  P(node) =0.3571429
##     class counts:     4     1
##    probabilities: 0.800 0.200 
##   left son=8 (3 obs) right son=9 (2 obs)
##   Primary splits:
##       Outlook splits as  -RL, improve=1.1157180, (0 missing)
##       Temp    splits as  -LR, improve=0.5924696, (0 missing)
##       Wind    splits as  LR,  improve=0.5924696, (0 missing)
## 
## Node number 5: 5 observations,    complexity param=0.1
##   predicted class=Yes  expected loss=0.2  P(node) =0.3571429
##     class counts:     1     4
##    probabilities: 0.200 0.800 
##   left son=10 (2 obs) right son=11 (3 obs)
##   Primary splits:
##       Wind    splits as  LR,  improve=1.1157180, (0 missing)
##       Outlook splits as  -LR, improve=0.5924696, (0 missing)
##       Temp    splits as  L-R, improve=0.5924696, (0 missing)
## 
## Node number 8: 3 observations
##   predicted class=No   expected loss=0  P(node) =0.2142857
##     class counts:     3     0
##    probabilities: 1.000 0.000 
## 
## Node number 9: 2 observations,    complexity param=0.1
##   predicted class=No   expected loss=0.5  P(node) =0.1428571
##     class counts:     1     1
##    probabilities: 0.500 0.500 
##   left son=18 (1 obs) right son=19 (1 obs)
##   Primary splits:
##       Wind splits as  LR, improve=1.386294, (0 missing)
## 
## Node number 10: 2 observations,    complexity param=0.1
##   predicted class=No   expected loss=0.5  P(node) =0.1428571
##     class counts:     1     1
##    probabilities: 0.500 0.500 
##   left son=20 (1 obs) right son=21 (1 obs)
##   Primary splits:
##       Outlook splits as  -LR, improve=1.386294, (0 missing)
##       Temp    splits as  L-R, improve=1.386294, (0 missing)
## 
## Node number 11: 3 observations
##   predicted class=Yes  expected loss=0  P(node) =0.2142857
##     class counts:     0     3
##    probabilities: 0.000 1.000 
## 
## Node number 18: 1 observations
##   predicted class=No   expected loss=0  P(node) =0.07142857
##     class counts:     1     0
##    probabilities: 1.000 0.000 
## 
## Node number 19: 1 observations
##   predicted class=Yes  expected loss=0  P(node) =0.07142857
##     class counts:     0     1
##    probabilities: 0.000 1.000 
## 
## Node number 20: 1 observations
##   predicted class=No   expected loss=0  P(node) =0.07142857
##     class counts:     1     0
##    probabilities: 1.000 0.000 
## 
## Node number 21: 1 observations
##   predicted class=Yes  expected loss=0  P(node) =0.07142857
##     class counts:     0     1
##    probabilities: 0.000 1.000

rpart.plot

rpart.plot(treeModel)

Binary splits so won't match our hand made tree

rpart.plot(treeModel, type =4, extra = 1)

rpart.plot(treeModel, type =4, extra = 2, clip.right.labs = FALSE, varlen = 0, faclen = 0)

Predict

test = read.csv("tennisTest.csv", stringsAsFactors=TRUE, header = TRUE)
str(test)
## 'data.frame':    2 obs. of  4 variables:
##  $ Outlook : Factor w/ 2 levels "Rain","Sunny": 2 1
##  $ Temp    : Factor w/ 1 level "Mild": 1 1
##  $ Humidity: Factor w/ 2 levels "High","Normal": 2 1
##  $ Wind    : Factor w/ 2 levels "Strong","Weak": 1 2
head(test)
##   Outlook Temp Humidity   Wind
## 1   Sunny Mild   Normal Strong
## 2    Rain Mild     High   Weak

lineFeed('prediction')
## 
##  prediction
predict(treeModel, newdata = test, type = 'class')
##   1   2 
## Yes Yes 
## Levels: No Yes

Regression

ISLR Hitters

page 304 - 305
1986 & 1987 season

require(ISLR)
#install.packages('ISLR',repos='http://cran.us.r-project.org')
dim(Hitters)
## [1] 322  20
head(Hitters)
##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits
## -Andy Allanson      293   66     1   30  29    14     1    293    66
## -Alan Ashby         315   81     7   24  38    39    14   3449   835
## -Alvin Davis        479  130    18   66  72    76     3   1624   457
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575
## -Andres Galarraga   321   87    10   39  42    30     2    396   101
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133
##                   CHmRun CRuns CRBI CWalks League Division PutOuts Assists
## -Andy Allanson         1    30   29     14      A        E     446      33
## -Alan Ashby           69   321  414    375      N        W     632      43
## -Alvin Davis          63   224  266    263      A        W     880      82
## -Andre Dawson        225   828  838    354      N        E     200      11
## -Andres Galarraga     12    48   46     33      N        E     805      40
## -Alfredo Griffin      19   501  336    194      A        W     282     421
##                   Errors Salary NewLeague
## -Andy Allanson        20     NA         A
## -Alan Ashby           10  475.0         N
## -Alvin Davis          14  480.0         A
## -Andre Dawson          3  500.0         N
## -Andres Galarraga      4   91.5         N
## -Alfredo Griffin      25  750.0         A
hitter = na.omit(Hitters)

C means career

Max depth = 6

# f = MakeFormula(Hitters, 'Salary') ; f 
hitterModel <- rpart(Salary ~ AtBat + Hits + HmRun + Runs + RBI + Walks + Years + 
                CAtBat + CHits + CHmRun + CRuns + CRBI + CWalks + League + 
                Division + PutOuts + Assists + Errors + NewLeague,
                data=hitter, method="anova",
                parms = list(split = "information"),
                control = rpart.control(minsplit = 10, minbucket = 10, cp = 0.01, maxdepth = 6) )

rpart.plot(hitterModel, type =4, clip.right.labs = FALSE, varlen = 0, faclen = 0)

Max depth = 3

hitterModel <- rpart(Salary ~ Hits + Years,
                data=hitter, method="anova",
                parms = list(split = "information"),
                control = rpart.control(minsplit = 10, minbucket = 10, cp = 0.01, maxdepth = 3) )

rpart.plot(hitterModel, type =4, clip.right.labs = FALSE, varlen = 0, faclen = 0)

Max depth = 2

hitterModel <- rpart(log(Salary) ~ Hits + Years,
                data=hitter, method="anova",
                parms = list(split = "information"),
                control = rpart.control(minsplit = 20, minbucket = 20, cp = 0.01, maxdepth = 2) )

rpart.plot(hitterModel, type =4, clip.right.labs = FALSE, varlen = 0, faclen = 0)

Scatterplot

library(RColorBrewer)
# c = seq(0,1,length.out=6)
col = cut(hitter$Salary, breaks=quantile(hitter$Salary, c(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)))
color = colorRampPalette(brewer.pal(6,"Reds"))(length(col))
plot(Hits ~ Years, data = hitter, col = color, pch = 15)
abline(v = 4.5, col = 'purple', lwd = 2 )
lines(c(4.5,25), c(118,118), col = 'purple', lwd = 2)

Home Grown Regression

model = rpart(z ~ x + y, data = d, method = 'anova')
rpart.plot(model, type =4, clip.right.labs = FALSE, varlen = 0, faclen = 0)

plot(d$x,d$y, pch = 15, col = colors[d$z], xlim = c(0,10), ylim=c(0,10), xlab = 'x', ylab = 'y') ; grid()
abline(v = 4.6, col = 'purple', lwd = 2)

plot(d$x,d$y, pch = 15, col = colors[d$z], xlim = c(0,10), ylim=c(0,10), xlab = 'x', ylab = 'y') ; grid()
abline(v = 4.6, col = 'purple', lwd = 2)
lines(c(4.6,10),c(4.7,4.7), col = 'purple', lwd = 2)

Titanic Revisited

Read training and tests set from Kaggle

train <- read.csv("TitanicTrain.csv", stringsAsFactors=TRUE)
test <- read.csv("TitanicTest.csv", stringsAsFactors=TRUE)
str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Gender     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
train$Name = NULL ; test$Name = NULL
train$Ticket = NULL ; test$Ticket = NULL
train$Cabin = NULL ; test$Cabin = NULL
train$Family = train$SibSp + train$Parch ; test$Family = test$SibSp + test$Parch
train$Survived = as.factor(train$Survived) 
str(train$Pclass) ; table(train$Pclass) ; summary(train$Pclass)
##  int [1:891] 3 1 3 1 3 3 1 3 3 2 ...
## 
##   1   2   3 
## 216 184 491
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   2.309   3.000   3.000
train$Pclass = factor(train$Pclass, levels = c(1,2,3), labels = c("P1", "P2", "P3"))
str(train$Pclass) ; table(train$Pclass) ; summary(train$Pclass)
##  Factor w/ 3 levels "P1","P2","P3": 3 1 3 1 3 3 1 3 3 2 ...
## 
##  P1  P2  P3 
## 216 184 491
##  P1  P2  P3 
## 216 184 491
test$Pclass = factor(test$Pclass, levels = c(1,2,3), labels = c("P1", "P2", "P3"))

Replace blank in Embarked (Didnt reduce levels to 3 ? )

lineFeed('before') ; table(train$Embarked) 
## 
##  before
## 
##       C   Q   S 
##   2 168  77 644
train$Embarked[train$Embarked == ""] = 'S'
lineFeed('after') ; table(train$Embarked) 
## 
##  after
## 
##       C   Q   S 
##   0 168  77 646
str(train$Embarked)
##  Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
summary(train$Embarked)
##       C   Q   S 
##   0 168  77 646

str(train)
## 'data.frame':    891 obs. of  10 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "P1","P2","P3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Gender     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
##  $ Family     : int  1 1 0 1 0 0 0 4 2 1 ...

## 
##  Survived in training set
## 
##   0   1 
## 549 342
## 
##  Gender in training set
## 
## female   male 
##    314    577

## 
##  Passenger class in training set
## 
##  P1  P2  P3 
## 216 184 491

## 
##  Training set Gender v. Survived
##         Survived
## Gender     0   1
##   female  81 233
##   male   468 109
## 
##  Training set Passenger Class v. Survived
##       Survived
## Pclass   0   1
##     P1  80 136
##     P2  97  87
##     P3 372 119

names(train) ; lf()
##  [1] "PassengerId" "Survived"    "Pclass"      "Gender"      "Age"        
##  [6] "SibSp"       "Parch"       "Fare"        "Embarked"    "Family"
psi('missing ages',sum(is.na(train$Age)))
## [1] "missing ages 177"
pNA <- function(x){sum(is.na(x))/length(x)}
p = apply(train,2,pNA)
psp('missing ages',p[5],2)
## [1] "missing ages 19.87 %"

train$ageNA = as.factor(ifelse(is.na(train$Age),'yes','no'))
test$ageNA = as.factor(ifelse(is.na(test$Age),'yes','no'))
lineFeed('train') ; summary(train$ageNA) ; lineFeed('test') ; summary(test$ageNA) 
## 
##  train
##  no yes 
## 714 177
## 
##  test
##  no yes 
## 332  86

Use kNN()

impTrain = kNN(train, variable = "Age", k = 5)
impTest = kNN(test, variable = "Age", k = 5)

Try mice() and rpart() to impute missing ages

Resulted in no improvement, so stay with kNN()

summary(impTrain$Age) ; lf() ; summary(impTest$Age) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   21.00   28.00   29.53   37.00   80.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.17   22.00   26.00   29.30   36.00   76.00
impTrain$Age_imp = NULL ; impTest$Age_imp = NULL

Training set

str(impTrain) 
## 'data.frame':    891 obs. of  11 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "P1","P2","P3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Gender     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 20 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
##  $ Family     : int  1 1 0 1 0 0 0 4 2 1 ...
##  $ ageNA      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...

Testing set

str(impTest) 
## 'data.frame':    418 obs. of  10 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : Factor w/ 3 levels "P1","P2","P3": 3 3 2 3 3 3 3 2 3 3 ...
##  $ Gender     : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
##  $ Family     : int  0 1 0 0 2 0 0 2 0 2 ...
##  $ ageNA      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

form = MakeFormula(train,"Survived") ; form
## Survived ~ PassengerId + Pclass + Gender + Age + SibSp + Parch + 
##     Fare + Embarked + Family + ageNA
## <environment: 0x7fe15eb93840>
titanicModel <- rpart(form, 
                data=impTrain, 
                method="class",
                parms = list(split = "information"),
                control = rpart.control(minsplit = 10, minbucket = 10, cp = 0.02, maxdepth = 5) )

rpart.plot(titanicModel, type =4, extra = 2, clip.right.labs = FALSE, varlen = 0, faclen = 0)

Try caret's train() to tune solution parameters

No improvement but poor use of caret

library(caret)
CVcontrol = trainControl(method = "cv", number = 10)  
model = train( impTrain[,3:11], impTrain[,2], 
                method = "rpart", tuneLength = 10, trControl = CVcontrol)
titanicModel = model$finalModel

rpart.plot(titanicModel)

Won't plot on a slide, but does plot in the console.

pred = predict(titanicModel, newdata = impTest, type = 'prob')
testPred = data.frame(test$PassengerId,pred)
names(testPred) = c('PassengerID','Survived')
write.csv(testPred, file = "testPred.csv", row.names = FALSE)

Framingham Heart Example

library(caTools)
set.seed(101)
frame = read.csv("/Users/davidkeck/Desktop/RAIK 370/Data/framingham.csv")
names(frame) ; printDim(frame)
##  [1] "male"            "age"             "education"      
##  [4] "currentSmoker"   "cigsPerDay"      "BPMeds"         
##  [7] "prevalentStroke" "prevalentHyp"    "diabetes"       
## [10] "totChol"         "sysBP"           "diaBP"          
## [13] "BMI"             "heartRate"       "glucose"        
## [16] "TenYearCHD"
## [1] "nrow=  4240" "  ncol=  16"
spl = sample.split(frame$TenYearCHD, SplitRatio = 0.75)
train = subset(frame, spl==TRUE)
test = subset(frame, spl==FALSE)

summary(frame)
##       male             age          education     currentSmoker   
##  Min.   :0.0000   Min.   :32.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:42.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :49.00   Median :2.000   Median :0.0000  
##  Mean   :0.4292   Mean   :49.58   Mean   :1.979   Mean   :0.4941  
##  3rd Qu.:1.0000   3rd Qu.:56.00   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :70.00   Max.   :4.000   Max.   :1.0000  
##                                   NA's   :105                     
##    cigsPerDay         BPMeds        prevalentStroke     prevalentHyp   
##  Min.   : 0.000   Min.   :0.00000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.: 0.000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median : 0.000   Median :0.00000   Median :0.000000   Median :0.0000  
##  Mean   : 9.006   Mean   :0.02962   Mean   :0.005896   Mean   :0.3106  
##  3rd Qu.:20.000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:1.0000  
##  Max.   :70.000   Max.   :1.00000   Max.   :1.000000   Max.   :1.0000  
##  NA's   :29       NA's   :53                                           
##     diabetes          totChol          sysBP           diaBP      
##  Min.   :0.00000   Min.   :107.0   Min.   : 83.5   Min.   : 48.0  
##  1st Qu.:0.00000   1st Qu.:206.0   1st Qu.:117.0   1st Qu.: 75.0  
##  Median :0.00000   Median :234.0   Median :128.0   Median : 82.0  
##  Mean   :0.02571   Mean   :236.7   Mean   :132.4   Mean   : 82.9  
##  3rd Qu.:0.00000   3rd Qu.:263.0   3rd Qu.:144.0   3rd Qu.: 90.0  
##  Max.   :1.00000   Max.   :696.0   Max.   :295.0   Max.   :142.5  
##                    NA's   :50                                     
##       BMI          heartRate         glucose         TenYearCHD    
##  Min.   :15.54   Min.   : 44.00   Min.   : 40.00   Min.   :0.0000  
##  1st Qu.:23.07   1st Qu.: 68.00   1st Qu.: 71.00   1st Qu.:0.0000  
##  Median :25.40   Median : 75.00   Median : 78.00   Median :0.0000  
##  Mean   :25.80   Mean   : 75.88   Mean   : 81.96   Mean   :0.1519  
##  3rd Qu.:28.04   3rd Qu.: 83.00   3rd Qu.: 87.00   3rd Qu.:0.0000  
##  Max.   :56.80   Max.   :143.00   Max.   :394.00   Max.   :1.0000  
##  NA's   :19      NA's   :1        NA's   :388

impFrame = kNN(frame, k = 5)

All NAs removed via kNN imputation

summary(impFrame)
##       male             age          education     currentSmoker   
##  Min.   :0.0000   Min.   :32.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:42.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :49.00   Median :2.000   Median :0.0000  
##  Mean   :0.4292   Mean   :49.58   Mean   :1.978   Mean   :0.4941  
##  3rd Qu.:1.0000   3rd Qu.:56.00   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :70.00   Max.   :4.000   Max.   :1.0000  
##    cigsPerDay         BPMeds        prevalentStroke     prevalentHyp   
##  Min.   : 0.000   Min.   :0.00000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.: 0.000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median : 0.000   Median :0.00000   Median :0.000000   Median :0.0000  
##  Mean   : 9.075   Mean   :0.02948   Mean   :0.005896   Mean   :0.3106  
##  3rd Qu.:20.000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:1.0000  
##  Max.   :70.000   Max.   :1.00000   Max.   :1.000000   Max.   :1.0000  
##     diabetes          totChol          sysBP           diaBP      
##  Min.   :0.00000   Min.   :107.0   Min.   : 83.5   Min.   : 48.0  
##  1st Qu.:0.00000   1st Qu.:206.0   1st Qu.:117.0   1st Qu.: 75.0  
##  Median :0.00000   Median :234.0   Median :128.0   Median : 82.0  
##  Mean   :0.02571   Mean   :236.7   Mean   :132.4   Mean   : 82.9  
##  3rd Qu.:0.00000   3rd Qu.:263.0   3rd Qu.:144.0   3rd Qu.: 90.0  
##  Max.   :1.00000   Max.   :696.0   Max.   :295.0   Max.   :142.5  
##       BMI          heartRate         glucose         TenYearCHD    
##  Min.   :15.54   Min.   : 44.00   Min.   : 40.00   Min.   :0.0000  
##  1st Qu.:23.07   1st Qu.: 68.00   1st Qu.: 72.00   1st Qu.:0.0000  
##  Median :25.40   Median : 75.00   Median : 78.00   Median :0.0000  
##  Mean   :25.80   Mean   : 75.88   Mean   : 81.64   Mean   :0.1519  
##  3rd Qu.:28.04   3rd Qu.: 83.00   3rd Qu.: 86.00   3rd Qu.:0.0000  
##  Max.   :56.80   Max.   :143.00   Max.   :394.00   Max.   :1.0000  
##   male_imp        age_imp        education_imp   currentSmoker_imp
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical    
##  FALSE:4240      FALSE:4240      FALSE:4135      FALSE:4240       
##                                  TRUE :105                        
##                                                                   
##                                                                   
##                                                                   
##  cigsPerDay_imp  BPMeds_imp      prevalentStroke_imp prevalentHyp_imp
##  Mode :logical   Mode :logical   Mode :logical       Mode :logical   
##  FALSE:4211      FALSE:4187      FALSE:4240          FALSE:4240      
##  TRUE :29        TRUE :53                                            
##                                                                      
##                                                                      
##                                                                      
##  diabetes_imp    totChol_imp     sysBP_imp       diaBP_imp      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:4240      FALSE:4190      FALSE:4240      FALSE:4240     
##                  TRUE :50                                       
##                                                                 
##                                                                 
##                                                                 
##   BMI_imp        heartRate_imp   glucose_imp     TenYearCHD_imp 
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:4221      FALSE:4239      FALSE:3852      FALSE:4240     
##  TRUE :19        TRUE :1         TRUE :388                      
##                                                                 
##                                                                 
## 

Court Decisions Example

stevens = read.csv("stevens.csv")
str(stevens)
## 'data.frame':    566 obs. of  9 variables:
##  $ Docket    : Factor w/ 566 levels "00-1011","00-1045",..: 63 69 70 145 97 181 242 289 334 436 ...
##  $ Term      : int  1994 1994 1994 1994 1995 1995 1996 1997 1997 1999 ...
##  $ Circuit   : Factor w/ 13 levels "10th","11th",..: 4 11 7 3 9 11 13 11 12 2 ...
##  $ Issue     : Factor w/ 11 levels "Attorneys","CivilRights",..: 5 5 5 5 9 5 5 5 5 3 ...
##  $ Petitioner: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Respondent: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LowerCourt: Factor w/ 2 levels "conser","liberal": 2 2 2 1 1 1 1 1 1 1 ...
##  $ Unconst   : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ Reverse   : int  1 1 1 1 1 0 1 1 1 1 ...
stevens$Docket = NULL ; stevens$Term = NULL
printDim(stevens)
## [1] "nrow=  566" "  ncol=  7"

library(caTools)
set.seed(101)
spl = sample.split(stevens$Reverse, SplitRatio = 0.8)
train = subset(stevens, spl==TRUE)
test = subset(stevens, spl==FALSE)
str(stevens)  
## 'data.frame':    566 obs. of  7 variables:
##  $ Circuit   : Factor w/ 13 levels "10th","11th",..: 4 11 7 3 9 11 13 11 12 2 ...
##  $ Issue     : Factor w/ 11 levels "Attorneys","CivilRights",..: 5 5 5 5 9 5 5 5 5 3 ...
##  $ Petitioner: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Respondent: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LowerCourt: Factor w/ 2 levels "conser","liberal": 2 2 2 1 1 1 1 1 1 1 ...
##  $ Unconst   : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ Reverse   : int  1 1 1 1 1 0 1 1 1 1 ...
summary(stevens)
##     Circuit                  Issue                   Petitioner 
##  9th    :122   CriminalProcedure:132   OTHER              :175  
##  5th    : 53   JudicialPower    :102   CRIMINAL.DEFENDENT : 89  
##  11th   : 49   EconomicActivity : 98   BUSINESS           : 79  
##  7th    : 47   CivilRights      : 74   STATE              : 48  
##  4th    : 46   DueProcess       : 43   US                 : 48  
##  8th    : 44   FirstAmendment   : 39   GOVERNMENT.OFFICIAL: 38  
##  (Other):205   (Other)          : 78   (Other)            : 89  
##               Respondent    LowerCourt     Unconst          Reverse      
##  OTHER             :177   conser :293   Min.   :0.0000   Min.   :0.0000  
##  BUSINESS          : 80   liberal:273   1st Qu.:0.0000   1st Qu.:0.0000  
##  US                : 69                 Median :0.0000   Median :1.0000  
##  CRIMINAL.DEFENDENT: 58                 Mean   :0.2473   Mean   :0.5459  
##  STATE             : 56                 3rd Qu.:0.0000   3rd Qu.:1.0000  
##  EMPLOYEE          : 28                 Max.   :1.0000   Max.   :1.0000  
##  (Other)           : 98

Boston Housing Values Example

Reference

Features

boston = read.csv("boston.csv")  # Find UCI  
printDim(boston)
## [1] "nrow=  506"  "  ncol=  16"

summary(boston)
##                 TOWN         TRACT           LON              LAT       
##  Cambridge        : 30   Min.   :   1   Min.   :-71.29   Min.   :42.03  
##  Boston Savin Hill: 23   1st Qu.:1303   1st Qu.:-71.09   1st Qu.:42.18  
##  Lynn             : 22   Median :3394   Median :-71.05   Median :42.22  
##  Boston Roxbury   : 19   Mean   :2700   Mean   :-71.06   Mean   :42.22  
##  Newton           : 18   3rd Qu.:3740   3rd Qu.:-71.02   3rd Qu.:42.25  
##  Somerville       : 15   Max.   :5082   Max.   :-70.81   Max.   :42.38  
##  (Other)          :379                                                  
##       MEDV            CRIM                ZN             INDUS      
##  Min.   : 5.00   Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46  
##  1st Qu.:17.02   1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19  
##  Median :21.20   Median : 0.25651   Median :  0.00   Median : 9.69  
##  Mean   :22.53   Mean   : 3.61352   Mean   : 11.36   Mean   :11.14  
##  3rd Qu.:25.00   3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10  
##  Max.   :50.00   Max.   :88.97620   Max.   :100.00   Max.   :27.74  
##                                                                     
##       CHAS              NOX               RM             AGE        
##  Min.   :0.00000   Min.   :0.3850   Min.   :3.561   Min.   :  2.90  
##  1st Qu.:0.00000   1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02  
##  Median :0.00000   Median :0.5380   Median :6.208   Median : 77.50  
##  Mean   :0.06917   Mean   :0.5547   Mean   :6.285   Mean   : 68.57  
##  3rd Qu.:0.00000   3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08  
##  Max.   :1.00000   Max.   :0.8710   Max.   :8.780   Max.   :100.00  
##                                                                     
##       DIS              RAD              TAX           PTRATIO     
##  Min.   : 1.130   Min.   : 1.000   Min.   :187.0   Min.   :12.60  
##  1st Qu.: 2.100   1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40  
##  Median : 3.207   Median : 5.000   Median :330.0   Median :19.05  
##  Mean   : 3.795   Mean   : 9.549   Mean   :408.2   Mean   :18.46  
##  3rd Qu.: 5.188   3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20  
##  Max.   :12.127   Max.   :24.000   Max.   :711.0   Max.   :22.00  
## 

Random Forest Specialization

Decision Trees

  • Pruning
  • Bagging
  • Boosting
  • Cross validation
  • caret package
  • Ensemble

Random Forest